Horizon Scanning & Data Science

Horizon Scanning, UCL

Bennett Kleinberg

19 Feb 2019

Horizon scanning and Data Science

Before we start

How tall am I (in cm)?

Today

  • Data collection
  • Information extraction
  • Crowdsourcing and “wisdom of the crowd”
  • Hype and hope

What’s the point?

  1. Problem of scale
  2. Problem of weak signals

Earthquakes and piranhas

Geller, 1999, 538 article

see Andrew Gelman’s post

“I would not be at all surprised if earthquakes are just practically, inherently unpredictable.”

(Ned Field)

Horizon scanning and data science

It all starts with the data.

It all starts with a problem.

Problem for today

Where are the weak signals? (data)

How can we deal with the information overload? (information extraction)

Data collection

Where are the weak signals?

  • newspaper coverage
  • patents

Newspaper coverage

Idea: weak signals in newspaper coverage

Newspaper coverage

Many newspapers have an Application Programming Interface (API):

  • offers structured access to the data
  • requires access credentials
  • pro: stable and structures
  • con: under platforms control

more on APIs and webscraping –> here

Newspaper coverage

Example: Guardian coverage on drill music

get_guardian("%22drill+music%22",
                        from.date="2018-02-01",
                        to.date="2018-03-01",
                        api.key="YOUR_ACCESS_KEY")

Drill music in the Guardian

Variables:

  • title
  • date
  • word count
  • article

Example

## [1] "Is UK drill music really behind London's wave of violent crime?"
## [1] "2018-04-09T17:09:31Z"

## [1] "<p>Until I was about 12 years old my only offence was playing ball games where it said “No ball games allowedâ€<9d>. But that all changed after the first time I witnessed someone being shot on the Myatts Field estate in Brixton, south London where I lived with my family. After firing his gun, the shooter ran towards me and my friends, took off his jumper, put it by one of our makeshift goalposts and told us to keep playing. Processing these sort of events at such a young age was traumatic. Four years on, I was heavily involved in gangs. By the age of 16 I had been shot at, cut on the face and stabbed in the chest, and one of my best friends had been killed, just a couple days before our GCSE exams.</p> <aside class=\"element element-rich-link element--thumbnail\"> <p> <span>Related: </span><a href=\"https://www.theguardian.com/society/2018/apr/17/faith-groups-crucial-role-tackling-knife-crime\">Could faith groups play a crucial role in tackling knife crime?</a> </p> </aside>  <p>I had strayed completely off the path my parents had intended for me. Criminal activity was an everyday thing: I would be armed on my way to the local chicken shop with friends. The radical change in my personal identity was alarming even to me. I would sometimes reflect on how far removed I had become from my previous morals. I was extremely fortunate that <a href=\"https://www.theguardian.com/society/2018/apr/17/faith-groups-crucial-role-tackling-knife-crim\">Pastor Mimi Asher’</a>s son was a close friend of mine; we were part of the same gang. She was desperate for her son to escape the clutches of the gangs. To do so she realised in order for her efforts to be effective she would have to reach his friends too. She opened her home, allowing it to become a sort of informal therapeutic community rehabilitation hub. During this time I was shot at outside her house and the bullet went through her front door. Yet she continued her offensive against what she called our true enemy, the ideology of “gangsterismâ€<9d>. Her counselling and Bible-based intervention work led me to denounce my gang involvement and turn my life around. Through her holistic approach and spiritual teachings I was able to claim back my true identity and strive towards excellence.</p>  <aside class=\"element element-pullquote element--supporting\"> <blockquote> <p>Feeling displaced and hopeless causes young men and women to squander their potential on street ambitions</p> </blockquote> </aside>  <p>A little more than a decade later, I volunteer at the youth charity, Youth in Action, which runs alongside Pastor Mimi’s church. I offer young people the mentoring and support that she once offered me.</p> <p>But gang culture isn’t confined to a few bad estates any more. It’s <a href=\"https://www.theguardian.com/society/2018/jan/31/british-gangs-using-violence-to-groom-children-as-drug-mules\">epidemic in and around London</a>, weighing heaviest on those living on council estates. The government has acknowledged that we have a problem. But it needs to mature in its response to surges in <a href=\"https://www.theguardian.com/media/2018/apr/02/social-media-violence-young-people-gangs-say-experts\">youth violence</a>. The textbook approach of increased stop and search and harsher penalties will only address the symptoms of knife crime; it will not solve it. The <a href=\"https://www.london.gov.uk/sites/default/files/london_lost_youth_services_sian_berry_jan2017.pdf\">reduction in youth services budgets by £22m across the capital since 2011</a> has most definitely been a huge blow to efforts to tackle the problem. Gang culture is a byproduct of the fractures in society; it’s not created by Instagram uploads and <a href=\"https://www.theguardian.com/commentisfree/2018/apr/10/drill-music-police-cuts-knife-crime-teenagers\">UK grime rappers</a>.</p> <p>More support needs to be given to interventions that have had some success, helping them to have an even greater impact. If Youth in Action had a permanent base that was fit for purpose, its the robust rehabilitation work I received in Pastor Mimi’s council estate house some years ago could be extended to more young men and women in the community.</p> <aside class=\"element element-rich-link element--thumbnail\"> <p> <span>Related: </span><a href=\"https://www.theguardian.com/society/2012/aug/21/gang-an-addiction-like-any-other\">Being in a gang is an addiction like any other</a> </p> </aside>  <p>The mentors have the understanding and insights to make a real impact, but we are caught in a frustrating limbo of being acknowledged as a key player yet not being given a space that will allow us to do what we do best. There are many derelict buildings in the area, but the council tends to sell them to private developers to build residences that most of the community can’t afford to live in. We are told that the planning permission we need to save lives may not be possible. We could potentially help thousands rather than a couple of hundred young people a week.</p> <p>I am proof that, with the right support, lives can be turned around. We should understand that if anyone is exposed to constant threats and traumas there is the potential for them to go in the wrong direction. Fear is what fuels such lifestyles; feeling displaced and hopeless causes young men and women to squander their potential on street ambitions. Let’s help them help themselves rather than brand them feral and unreachable. The hard to reach are still within our reach.</p>"

More newspaper sources

  • Associated Press
  • BBC
  • Bloomberg
  • Reuters
  • (list of APIs here)

Why newspapers?

  • might offer initial hints (e.g. knife crime & drill music association)
  • offer rich accounts of events
  • but: curated!
  • possible for future: comment sections

Patents

Idea: weak signals in patent filings

  • patents as early indicators of trends
  • by definition “new”

Patents

Case study: 3-D printing

CPC classification of patents espace.net

Accessing patent data

Running an API request

Through the browser:

http://www.patentsview.org/api/patents/query?q={
  "_and":[{
    "_gte":{
      "patent_date":"2009-01-01"
    }
  },{
      "cpc_subgroup_id":"B29C64/00"}]}

Running an API request

Through an API wrapper:

pv_3d_print = search_pv(query = query
                        , fields = fields
                        , all_pages = TRUE)

Retrieved patents data

  • patent number
  • patent year
  • patent abstract
  • assignee info
    • country, long-lat, city
    • number of inventors, number of patents
  • CFC meta data (group, subgroup)

Example

pv_3d_print$data$patents$patent_abstract[1]
## [1] "A method of manufacturing a three-dimensional object facilitates removal of the three-dimensional object from the platen on which the object was formed. The method includes rotating the platen from a horizontally level position to a position at an angle to the level position to enable gravity to urge the three-dimensional object away from the platen and inductively heating the platen to melt support material at the boundary of the object and the platen to release the three-dimensional object from the platen."

Patents

Case study: Weapons

  1. Seach CPC classification
  2. Define search query
  3. Retrieve data

1. Seach CPC classification

https://worldwide.espacenet.com/…

2. Define search query

  • general (overarching category): “F41”
  • start in 2014
http://www.patentsview.org/api/patents/query?q={
  "_and":[{
    "_gte":{
      "patent_date":"2014-01-01"
    }
  },{
      "cpc_subgroup_id":"F41"}]}

3. Retrieve data

pv_weapons = search_pv(query = query
                        , fields = fields
                        , all_pages = TRUE)

Example

## [1] "A vehicle or machine suitable for operation on uneven or inclined surfaces, such as a forest work unit, comprises at least three frame parts and two rotational planes. Each frame part has rotational planes, in each particular case, on an interface between two successive frame parts of the vehicle or machine. The rotational planes are, in each particular case, planes perpendicular to the longitudinal axis of the vehicle or machine in the neutral position. The frame parts are thus arranged to be independently rotatable in relation to the rotational plane."

Resources

Google Toolbox example

Information extraction

How can we deal with the information overload?

  • frequency analysis
  • information extraction (NLP)

Intermezzo

The language challenge?

Possible approaches

  • token occurrence
  • sentiment
  • Keywords-in-context
  • Co-occurrences

Token occurrence

Eyeballing but uninformative.

Sentiment analysis: aim

  • measure positive/negative tone
  • “emotionality” of a text
  • builds on the “language -> behaviour” and “cognition -> language” nexus

Basics of sentiment analysis

  1. tokenise text
  2. construct a lexicon of sentiment words
  3. judge the sentiment words
  4. match tokens with sentiment lexicon

Sentiment analysis

  • shifting units of analysis
  • single texts, sentences, dynamic approaches
  • current idea: sentiment is dynamic within text

Sentiment trajectories

(e.g. our EMNLP 2018 paper)

Sentiment for single articles

Sentiment trajectories for all articles

Keywords in context

docname pre keyword post
text2 . They are seen as gang members , hardened criminals ,
text3 series of incidents in which gang violence was supposedly catalysed by
text3 platform which can provide the gang and / or gang members
text3 the gang and / or gang members with a sense of
text3 67 dress , flirting with gang imagery with matching black sportswear
text5 were part of the same gang . She was desperate for
text5 led me to denounce my gang involvement and turn my life
text5 once offered me . But gang culture isnt confined to a

Keywords in context

docname pre keyword post
11 text3 violence was supposedly catalysed by drill music MCs taunted each other
12 text3 referring to knife crime , drill DJ Bempah argued : if
13 text3 who helped pioneer the US drill sound . Photograph : Michael
14 text3 of the way that UK drill is networked via social media
15 text3 there are valid worries that drill is not just reflecting criminality

Co-occurrences

load('guardian_fcm_g.RData')
knitr::kable(fcm_g[1:6, 1:6])
document unusual teenagers london excluded school possession
unusual 0 3 6 1 2 1
teenagers 0 13 227 17 77 4
london 0 0 1098 101 562 17
excluded 0 0 0 11 71 1
school 0 0 0 0 107 3
possession 0 0 0 0 0 0

Text network

Case study patents

  • use patents data to identify trends
  • patents per category over time
  • example: weapons (pistols vs armour)

Case study patents

Case study TechCrunch

  • all headings of TechCrunch and VentureBeat for 2017
  • ~23k headings
  • e.g. “MyTomorrows raises further 10M to help access drugs in development”

Might help identify “tech trends”

Case study TechCrunch

Case study TechCrunch

Non text-based data

“Traditional” data

  • directly quantified
  • essentially directly usable
  • especially: price data, trading data

Crowdsourcing

Crowd intelligence

Maybe collectively, we are better than individually…

your_estimates_cm = ...

What does this tell you?

The parable of the ox, FT, 2012

Boundaries of crowd intelligence

Hype and hope

Pitfalls and hype

  • long vs. wide data
  • unknown unknowns of trend places

Pitfalls and hype

Evaluation of horizon scanning

Pitfalls and hype

Assumptions, assumptions, assumptions

Pitfalls and hype

Beware of category mistakes

The hope

  • Until now: impossible task
  • Information overload for analysts
  • Data Science (esp. NLP) can address this
  • Machine learning (potentially) promising
  • Human-in-the-loop might help (note: ASSUMPTION)

Example: gang-violence

But

“Just gather more data!”

More data = better solutions

Enter: the spurious correlation

http://www.tylervigen.com/spurious-correlations

Ongoing work on trend detection

  • cryptocurrency fraud detection (coin-level)
  • shifts in abusive language (individual-level)

Crypto fraud detection

Kamps & Kleinberg, 2018

Abusive language

Kleinberg, van der Vegt, Gill (forthcoming)

Abusive language

Recap

  • Some useful resources
  • Information extraction problem
  • Pitfalls of data science
  • Promise in NLP

Advice of the year: Learn how to code.

Note

Full module on Data Science

https://github.com/ben-aaron188/ucl_aca_20182019

APIs/Webscraping, Text mining, Machine learning

“I would not be at all surprised if earthquakes are just practically, inherently unpredictable…. You never know; some silver bullet could come along and prove useful.

(Ned Field)

If you only read one book in 2019…

Read: “The Signal and the noise”, Nate Silver

END.